Air Quality Index Analysis by Binjie Lai

## [1] "/Users/BinjieLai/Documents/udacity/R/EDA_Course_Materials/Project 3 Binjie_lai_update"
##  [1] "Date"           "Year"           "Month"          "Day"           
##  [5] "PM2.5"          "PM10"           "CO"             "SO2"           
##  [9] "NO2"            "X1hO3"          "X8hO3"          "WD"            
## [13] "WS"             "Temp"           "RH"             "sea.level.pres"
## [17] "X6hr.precip"    "dewpoint"       "visibility"     "City"          
## [21] "lon"            "lat"
##       Date           Year          Month             Day       
##  1/1/14 :   3   Min.   :2013   Min.   : 1.000   Min.   : 1.00  
##  1/10/14:   3   1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00  
##  1/11/14:   3   Median :2013   Median : 7.000   Median :16.00  
##  1/12/14:   3   Mean   :2013   Mean   : 6.526   Mean   :15.72  
##  1/13/14:   3   3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00  
##  1/14/14:   3   Max.   :2014   Max.   :12.000   Max.   :31.00  
##  (Other):1077                                                  
##      PM2.5             PM10              CO           SO2         
##  Min.   :  5.42   Min.   : 14.65          :143   Min.   : 0.7249  
##  1st Qu.: 30.55   1st Qu.: 50.21   686    :  5   1st Qu.: 4.1049  
##  Median : 50.69   Median : 74.89   743    :  5   Median : 6.0504  
##  Mean   : 65.22   Mean   : 87.08   454    :  4   Mean   : 7.6950  
##  3rd Qu.: 83.00   3rd Qu.:108.70   498    :  4   3rd Qu.: 9.3388  
##  Max.   :391.63   Max.   :408.78   505    :  4   Max.   :45.6772  
##  NA's   :94       NA's   :152      (Other):930   NA's   :118      
##       NO2             X1hO3             X8hO3               WD          
##  Min.   : 3.256   Min.   :  4.285   Min.   :  2.738   Min.   :  0.3189  
##  1st Qu.:16.033   1st Qu.: 34.931   1st Qu.: 28.675   1st Qu.: 56.3967  
##  Median :20.881   Median : 50.094   Median : 42.684   Median :143.7384  
##  Mean   :23.321   Mean   : 55.672   Mean   : 46.078   Mean   :158.4243  
##  3rd Qu.:28.883   3rd Qu.: 71.025   3rd Qu.: 58.548   3rd Qu.:249.3315  
##  Max.   :66.551   Max.   :149.858   Max.   :134.037   Max.   :360.0000  
##  NA's   :119      NA's   :159       NA's   :159       NA's   :13        
##        WS              Temp              RH        sea.level.pres    
##  Min.   :0.0156   Min.   :-5.450   Min.   :11.60   Min.   :   11.00  
##  1st Qu.:0.9601   1st Qu.: 9.255   1st Qu.:54.77   1st Qu.:   96.62  
##  Median :1.5978   Median :18.937   Median :69.29   Median :  176.17  
##  Mean   :1.7865   Mean   :17.371   Mean   :66.39   Mean   :  248.21  
##  3rd Qu.:2.4521   3rd Qu.:25.725   3rd Qu.:79.71   3rd Qu.:  239.69  
##  Max.   :5.8587   Max.   :35.367   Max.   :98.35   Max.   :25795.38  
##  NA's   :13       NA's   :13       NA's   :13      NA's   :13        
##   X6hr.precip        dewpoint         visibility            City    
##  Min.   : 0.000   Min.   :-28.375   Min.   : 0.625   Beijing  :365  
##  1st Qu.: 0.000   1st Qu.:  1.121   1st Qu.: 9.565   Guangzhou:365  
##  Median : 0.000   Median : 12.314   Median :15.000   Shanghai :365  
##  Mean   : 1.146   Mean   : 10.040   Mean   :15.632                  
##  3rd Qu.: 0.300   3rd Qu.: 20.970   3rd Qu.:21.400                  
##  Max.   :40.000   Max.   : 26.648   Max.   :30.000                  
##  NA's   :13       NA's   :13        NA's   :13                      
##       lon             lat       
##  Min.   :23.12   Min.   :113.3  
##  1st Qu.:23.12   1st Qu.:113.3  
##  Median :31.25   Median :116.4  
##  Mean   :31.43   Mean   :117.1  
##  3rd Qu.:39.91   3rd Qu.:121.5  
##  Max.   :39.91   Max.   :121.5  
## 

This is the dataset about some air quality index for three cities in China, Beijing, Shanghai and Guangzhou. This data is obtained from the Minisitry of Environmental Protection website in China. All the index are from Mar 1 2013 to Feb 28 2014, lasing one year.

=============================================================================

Univariate Plots Section

In this section, I want to know some general information about the dataset.

Histogram of Visibility for All Cities

Visibility values are between 0 and 30.

Why the visibility of 30, the max, has so many counts? Visibility is measured by the distance you can see in daytime. When it is higher than 30km, the visibility is recorded with 30. Therefore there are many counts for 30.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.625   9.565  15.000  15.630  21.400  30.000      13

Histogram of RH in Beijing

In most days, the Relative Humidity of Beijing is between 30 and 80.

Histogram of PM2.5 for All Cities

Most values of PM2.5 is between 0 and 100. The mean PM2.5 is 65.22 and the median is 50.69.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    5.42   30.55   50.69   65.22   83.00  391.60      94

Contrast in Three Cities for Histogram of PM2.5

In general, Beijing has higher value of PM2.5 than Guangzhou and Shanghai. This plot refects the serious air pollution in Beijing.

New variable PM and its histogram for all cities

=============================================================================

Univariate Analysis

What is the structure of your dataset?

It is a long form dataset, inluding date factors and air quality index.

What is/are the main feature(s) of interest in your dataset?

The main features are PM2.5 and visibility. PM2.5 is drawing more and more attention in China as many cities are suffering from it. Visibility is straight forward and usually relates to PM2.5. I’d like to find out how PM2.5 changes with time, and what the relationship with PM2.5, visibility and other index.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

All the air quality index are importance influence factor. I will focus on PM10, Relative Humidity, Wind Speed, etc.

Did you create any new variables from existing variables in the dataset?

Yes. I created a variable for the Particular Matters(PM), as the mean of PM10 and PM2.5. PM10 and PM2.5 are both belong to particular matters but with different diameters. PM is a good variable to represent them.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I long-transformed the dataset, and make “city” as a column. There were a few NA in the dataset. I will filter them out if it is necessary in the next few sections.

For example, if I want to divide visibility into three parts, (0, 10], (10, 20], (20, 30], I need to filter out all the point where visibility is NA. I use code “subset(bsg, !is.na(visibility))”.

=============================================================================

Bivariate Plots Section

Select key index as a new dataset & make scatterplot matrices

Visibility is higher from July to November. Summer has better visibility in China, which means better air quality.

Time Series Plot for Visibility for All Cities

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 13 rows containing missing values (stat_smooth).

Time Series Plot for PM2.5 for All Cities

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 94 rows containing missing values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_path).

PM2.5 in winter is higher than summer. High PM2.5 means low air quality. Then I make the scatterplot for visibility and PM2.5, and calculate the correlation coefficient. It is -0.656, less than -0.3, which mean it is meaningful, but small relationship between PM2.5 and visibility.

Scatterplot for PM2.5 and visibility for All Cities.

## Warning: Removed 107 rows containing missing values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  PM2.5 and visibility
## t = -27.3105, df = 986, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6903674 -0.6192317
## sample estimates:
##        cor 
## -0.6562553

I make the box plot for PM2.5 and visibilityfor All Cities.

## Warning: Removed 94 rows containing non-finite values (stat_boxplot).

Remove the outliners.

Higher PM2.5 leads to lower visibility.

## Warning: Removed 123 rows containing non-finite values (stat_boxplot).

I want to know more about visibility. What other feature may also influence visibility? I’d like to try PM10, RH, WS, Temp. To make the analysis more acurrate, I only focus on Beijing.

Here are the correlation coefficients.

PM10 vs. visibility: -0.53

RH vs. visibility: -0.35

WS vs. visibility: -0.21

Temp vs. visibility: 0.27

From the result above, PM10 and PM2.5 have strong correlation with visibility, while relative humidity, wind speed and temperature have small or non-meaningful correlation with visbility.

## Warning: Removed 57 rows containing missing values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  PM10 and visibility
## t = -19.284, df = 929, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5790216 -0.4871461
## sample estimates:
##        cor 
## -0.5346619
## 
##  Pearson's product-moment correlation
## 
## data:  RH and visibility
## t = -12.3324, df = 1080, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4025073 -0.2979811
## sample estimates:
##        cor 
## -0.3513385
## 
##  Pearson's product-moment correlation
## 
## data:  WS and visibility
## t = 7.1959, df = 1080, p-value = 1.159e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1562920 0.2700506
## sample estimates:
##       cor 
## 0.2138964
## 
##  Pearson's product-moment correlation
## 
## data:  Temp and visibility
## t = 9.3492, df = 1080, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2175805 0.3278789
## sample estimates:
##      cor 
## 0.273629

Just for my curiosity, I wonder whether strong wind can blow PM2.5 away and result in low pollution. So I test the relationship between PM2.5 and Wind Speed. The correlation coefficient is -0.25, no less than -0.3. It seems that PM2.5 has little relationship with wind speed.

## Warning: Removed 107 rows containing missing values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  WS and PM2.5
## t = -8.0374, df = 986, p-value = 2.61e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3056120 -0.1885168
## sample estimates:
##        cor 
## -0.2479699

=============================================================================

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Firstly, I plot the scatterplot matrices for the six key index, to see the inter relationship between any of the two.

Then I make time series plots in terms of visibility and PM2.5, the features of interest. It turns out that summer time has better air quality, with hgiher visibility and lower PM2.5.

Finally I try to figure out the influence factors on visibilty. I have analyzed PM2.5, PM10, Relative Humidity, wind speed and temperature. PM2.5 and PM10 stand out and show their effect on visibility. Relative humidity seems also have relationship with visibility, and I will figure it out in the next section.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I check the relationship between PM2.5 and wind speed. It seems that when wind speed increases, it is less likely to have a PM2.5-polluted day. Strong wind blows little particular materials away!

What was the strongest relationship you found?

Visibility has strong relationship with PM2.5 and PM10, but they are not linear related. Visibility has less variance when PM2.5 and PM10 are low.

=============================================================================

Multivariate Plots Section

Heat Map for Visibility for all Cities.

Obviously, visibility in the range from March to September is better than from September to the next March.

From those two plots below, winter’s influence on air quality only works in Beijing, where both visibiity and PM2.5 show poorer air quality in winter. However, the two index keep stable in the whole year in Guangzhou. And in Shanghai PM2.5 is stable while visibility has a peak in summer. I guess it is because Shanghai is usually stormy in summer. Visibility usually increases after heavy rains.

Time Series Plot for Visibility in Three Cities

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 7 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 3 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 3 rows containing missing values (stat_smooth).

Time Series Plot for PM2.5 in Three Cities

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 23 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 45 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 26 rows containing missing values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_path).

I also want to double check the relationship between wind speed and visibility. It is also only true for Beijing?

It turns out that in Guangzhou and Shanghai, where has better air quality, wind speed has little influence on PM2.5 or visibility. However, Beijing is more polluted, and when wind speed increases, it is less likely to have a PM2.5-polluted day. Strong wind only works in Beijing!

Besides, the coefficients for wind speed and visibility for three cities are 0.32, 0.089 and 0.23. Therefore there is no clear linear relaitionship between them.

## Warning: Removed 30 rows containing missing values (geom_point).
## Warning: Removed 48 rows containing missing values (geom_point).
## Warning: Removed 29 rows containing missing values (geom_point).

## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing missing values (geom_point).

## [1] 0.3247127
## [1] 0.08935503
## [1] 0.2317084

Therefore, I guess the meteorologic index like relative humidity and wind speed have less effect on visibility, while the environmental index like Particular materials do have strong effect on visibility. In the dataset, there are two more meterologic index, sea level pressure and dewpoint; and two more environmental index, NO2 and SO2. I will demonstrate the fact that envrionmental index are more significant than meteorologic index with the four new index later in the summary section.

============================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

First, different cities have different visibility performances. Analysis on other features must base on every city. Combine the data from all the cities may lead to wrong conclusion.

In general, air quality in summer is better than winter.

Were there any interesting or surprising interactions between features?

Yes. I used to believe strong wind can aways bring nice air and must be the most important factor. However, after investigating the relationship between it and visibility, I realize that wind speed only works in some area. In other places, maybe other factors are more significant than wind speed.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

=============================================================================

Final Plots and Summary

Plot One

Heat Map for Visibility for Three Cities.

Description One

First, in general, visibility in the range from March to September is better than from September to the next March. Second, the most polluted days with low visibility appear in Beijing. Third, visibility changes more dramatically in Beijing because it is more sensitive to natural conditions such as wind speed.

=============================================================================

Plot Two

## Warning: Removed 94 rows containing missing values (geom_point).
## Warning: Removed 23 rows containing missing values (geom_point).
## Warning: Removed 45 rows containing missing values (geom_point).
## Warning: Removed 26 rows containing missing values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  PM2.5 and visibility
## t = -27.3105, df = 986, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6903674 -0.6192317
## sample estimates:
##        cor 
## -0.6562553
## 
##  Pearson's product-moment correlation
## 
## data:  PM2.5 and log(visibility)
## t = -37.4783, df = 986, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7910712 -0.7395061
## sample estimates:
##        cor 
## -0.7665213

Description Two

Plot Three

## Warning: Removed 22 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning: Removed 25 rows containing non-finite values (stat_boxplot).
## Warning: Removed 9 rows containing non-finite values (stat_boxplot).
## Warning: Removed 15 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Description Three

Base on the plot above, bigger variance on index appears when visibility is low. Variance decreases when visibility increases. In good days with high visibility, all the environmental index perform very well and keep in low level. However, bad air quality with low visibility may have more complex influencial factors that may result in pollution. Therefore the variance is bigger in low-visible days.


Reflection

Thoughout the analysis, I don’t how to connect different features at first. Then I decide to use visibility as the basic index as it is very straight forward. I also have trouble to analysis bivariate plot. They seem have no clear relationship! Calculating correlation coefficient helps a lot. However, linear regression cannot represent the relationship in the best way all the times. It is better for me to find other statistic method to do the analysis. My last struggle is to adjust the size of plot in HTML. I still can’t figure it out…

For the future work, I will be interesting to construct a model for visibility. But I do think there should be more features included for the model, considering different situation in different places. On the other hand, one-year data is not enough. Data with longer time range, like 5 to 10 years, or data with more detail information, like every hour performance in a whole day, will be very improtant for analyzing visibility.